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(SI ■ ABSTRACT 

^ . We provide classifications for all 143 million non-repeat photometric objects in the Third Data 

^ ' Release of the Sloan Digital Sky Survey (SDSS) using decision trees trained on 477,068 objects with 

SDSS spectroscopic data. We demonstrate that these star/galaxy classifications are expected to be 
reliable for approximately 22 million objects with r < 20. The general machine learning environment 
■ Data-to-Knowledge and supercomputing resources enabled extensive investigation of the decision tree 

parameter space. This work presents the first public release of objects classified in this way for an 
entire SDSS data release. The objects are classified as either galaxy, star or nsng (neither star nor 
galaxy), with an associated probability for each class. To demonstrate how to effectively make use 
of these classifications, we perform several important tests. First, we detail selection criteria within 
'nJ" ■ the probability space defined by the three classes to extract samples of stars and galaxies to a given 

' completeness and efficiency. Second, we investigate the efficacy of the classifications and the effect 

of extrapolating from the spectroscopic regime by performing blind tests on objects in the SDSS, 
. 2dF Galaxy Redshift and 2dF QSO Redshift (2QZ) surveys for which spectra are available. We find 

that, for a sample giving a maximal combined completeness and efficiency, the completeness values 
are 98.9% and 91.6% and the corresponding efficiencies are 98.3% and 93.5% for galaxies and stars, 
f"! . respectively, in the SDSS. We also test our star-galaxy classification by studying the inverse of this 

Qh' sample, finding quasars in the nsng sample with a completeness and efficiency of 88.5% and 94.5% 

from the SDSS and 94.7% and 87.4% from the 2QZ. Given the photometric limits of our spectroscopic 
^ ■ training data, we effectively begin to extrapolate past our star-galaxy training set at r ~ 18. By 

^ ' comparing the number counts of our training sample with the classified sources, however, we find that 

, our efficiencies appear to remain robust to r ~ 20. As a result, we expect our classifications to be 

accurate for 900,000 galaxies and 6.7 million stars, and remain robust via extrapolation for a total 
of 8.0 million galaxies and 13.9 million stars. The latter sample should prove useful in constructing 
, follow-up spectroscopic surveys. Characterizing the success of our classifications fainter than r ~ 20 

}_j ■ will require fainter spectroscopic data. 

\ Subject headings: astronomical data bases: miscellaneous — catalogs — methods: data analysis — surveys 
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1. INTRODUCTION 

It is well-known that classification of objects in astro- 
nomical survey data provides a fundamental character- 
ization of the dataset, and forms a vital step in under- 
standing the ensemble properties of the dataset and the 
properties of individual objects. In this paper, we apply 
automated machine learning algorithms to classify the 
143 million non-repeat objects in the third data release 
(DR3; Abazaiian ct al. 2005) of the Sloan Digital Sky 
Survey (SDSS^; York et al. 2000). The SDSS is used be- 
cause the combination of the quality and accuracy of the 
survey CCD photometry, the approximately concurrent 
spectroscopy performed with the same telescope, and the 
large number of objects, is ideally suited to the methods 
we employ in this work. 

As our primary goal is to perform reliable star-galaxy 

Ele ctronic address: lnball@astro.uiuc.edul 
^ |http: //«w. sdss .orgI 



separation, we have used three classes: galaxy, star, and 
nsng (neither star nor galaxy) . The use of this third class 
is novel, and is included to not only cleanly separate out 
those objects that clearly are neither stars nor galaxies 
(e.g., quasars), but to also improve the extrapolation of 
our classifications well beyond the nominal magnitude 
limits of our current training sample. The latter benefit 
arises since star-galaxy separation becomes less accurate 
at fainter magnitudes due to source confusion. In this 
case, our algorithm can assign sources to the third cate- 
gory when a clear star-galaxy classification is not feasi- 
ble, thereby minimizing contamination for lower signal- 
to-noise sources. 

In this paper, the learning algorithm we employ is the 
axis-parallel decision tree. A decision tree is trained on 
the subset of data for which spectra are available and 
is subsequently used to classify the much larger sample 
for which there is just photometry. This is the first pub- 
lic release of objects classified using this method for the 
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entire SDSS DR3. A natural question is "Why provide 
classification for all objects, most of which are near the 
photometric limits of the survey, where classifications are 
less robust?" We have three reasons for classifying the 
entire SDSS DR3. First, given the speed of applying 
our classifications, we have found that it is considerably 
easier to handle large datasets when no artificial cuts are 
used. With the growth in the number of large astronomi- 
cal surveys that are accessible from a virtual observatory, 
we feel this point is becoming increasingly more impor- 
tant, as it greatly simplifies any data federation issues a 
fellow scientist may need to perform if they wish to use 
our classifications. Second, classifying all objects allows 
others with fainter training data to verify the efficacy of 
our classifications. Finally, since our classifications are 
probabilistic in nature, by providing the classifications 
for all objects, we allow others to make their own prob- 
ability cuts, perhaps using combinations of classes, to 
perform their own science. For example, our approach 
simplifies the task of performing follow-up spectroscopy, 
as a user of our fully classified catalog has an easier task 
in characterizing any selection effects. 

Decision tree learning algorith ms have been in exis- 
te nce since the 19 70s, see, e.g., IBreiman et al.l l|1984D 
or iQuinlanI l)1986j) . They are able to classify objects 
into discrete or continuous categories, scale well to large 
datasets, are fairly robust to inequitably-sampled train- 
ing sets, and can cope with values from individual objects 
that happen to be bad or irrelevant for the training tar- 
gets being considered. Axis-parallel decision trees, par- 
ticularly smaller ones, are also easily understood, provid- 
ing clear criteria for the splitting of the data set into con- 
stituent classes or values. This is in c ontrast to, for ex - 
ample, artificial neural networks (e.g.. lBall et al.ir2004l) . 
where the resulting weights are basically a black box, or 
oblique decision trees (e.g., Murthv ct aQ^^), where 
linear combinations of parameters, or hyper-planes, are 
used to subdivide the data. Given the size of our dataset, 
however, our trees are generally rather large, making it 
difficult to fully interpret the classification rules, but one 
can in principle follow the splitting of the population. 
While oblique trees may be less transparant in their op- 
eration, they do have the benefit of generally produc- 
ing smaller trees. Most astronomical results using neural 
networks and decision trees show that the two methods 
give c omparable results — an example is Bazcll & Ahaj 
l)2001l) . described below. 

A number of previous efforts have used decision trees to 
classif y objects in astronomical surveys. ISalzberg et al.l 
use decision trees to separate stars and cosmic 
ray hits in Hubble Space telescope images and are able 
to do so with an accuracy of over 95%. Th eir algorithm, 
known as OCl, is an oblique decision tree. lOwens et al.l 
l|1996() use a version of OCl to classify galaxy morpholo- 
gies in the European Southern Observatory-Uppsala 
surface photometry catalog (Laubcrts & Valcntiin 198^ 
and show their results to b e comparable to those of 
iStorrie-Lombardi et alJ l)1992D . who use artificial neural 
n etworks on the sam e data. 

iWhite et al.l l)2000() construct a radio-selected sample 
of qu asars using the VLA-FIRST survey (Becke r" st al.l 
19951) and the opt ical APM-POSS-I quasar survey of 
McMahon fc Irwinl (,199 21. They show that using de- 
cision trees improves on simpler methods and enables 



quasars to be selected with 85% efficiency. They also 
show the completeness and efficiency of the selection as 
a function of threshold probability fr om the decision tree 
that the object is a quasar. Bazcll fc Ahal ()2001|) com- 
pare Naive Bayes, decision tree and neural network clas- 
sifiers for morphological galaxy classification and show 
t hat the latter two are comparable. 

iSuchkov et alJ l)2005j) take advantage of the much im- 
proved data, both in quality and quantity of objects, 
available from the SDSS. They apply the oblique decision 
tree classifier ClassX, b ased on OCl, to the SDSS DR2 
ijAbazaiian et alJ[2004t) to classify objects into star, red 
star (type M or later), AGN photometric redshift and 
galaxy photometric redshift bins. Three representative 
samples, each of 10^ photometric objects, one of which 
extends 2 magnitudes fainter than the spectroscopy, are 
also investigated. They give a table of percentage correct 
classifications from which one can calculate completeness 
and efficiency. 

Many studies, too numerous to list here, use other 
machine learning techniques to classify astronomical 
survey objects, including star-galaxy separation and 
quasar identification. These previous works, however, do 
not provide spectroscopically-trained classifications with 
probabilities on the scale of the 143 million sources given 
here for the SDSS DR3. As part of our work, we make 
publicly available^ the full set of our 143 million object 
classifications for the SDSS DR3 and probability cuts to 
select galaxies and stars for a range of completenesses 
and efficiencies. 

The rest of the paper is as follows. In S]21we describe 
the datasets used in the construction and testing of the 
catalog. [0 describes the decision trees, their optimiza- 
tion using supercomputing resources, and the probability 
cuts to create subsamples of particular objects. ^Itl^^n 
describes the optimal learning parameters found and the 
resulting datasets. In 50 we discuss the efficacy of our 
star-galaxy separation, the extrapolation of our results 
to fainter magnitude, and new projects we have under- 
taken to extend this work. We conclude in SjHl 

2. DATA 

The SDSS is a project to map tt steradians of 
the northern Galactic cap in five bands (u, g, r, 
i and z) from 3500-8900 A. The final survey will 
provide photometry for approximately 2.5 x 10* ob- 
jects, of which a multifiber spectrograph will pro- 
vide spectra and redshifts for around one million. 
The spectroscopic targeting pipe line selects targets for 
spectroscopy from the imaging ( Eiscnst ein et alJ l2?ffllt 
[Richa rds et al. 2002; Strauss et al. 2002), and a tiling 
algorithm (Blantonet al. 2003) optimally assigns spec- 
troscopic fibers to the select ed targets. Furthe r details 
of the survey are giv en in lYork et all ()2000() and in 
lAbazaiian et alJ ()2005|) for DR3. 

The datasets we use in this work consist of train- 
ing, testing'^, blind testing, and application sets. The 
training and testing sets are drawn from objects in 



"http : //quasar . astro . uiuc . edu/rml 

^ Note that some papers in the astronomy literature use the 
term 'vaUdation set' to refer to what is here described, in the usual 
machine-learning sense, as the testing set. The vahdation set can 
be used, for example, in the training of artificial neural networks 
to determine when to stop training in order to avoid overfitting. 
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the specObj view of tlie DR3 Catalog Archive Server 
(CAS) public database^. The blind testing sets are from 
the SDSS DR3, 2dF Galaxy Redshift Survey (2dFGRS; 
IColless et al. 2003) fin a.1 data release^ and the 2dF QSO 
Redshift Survey f2QZ: lCroom et al.l l20041 final data re- 
lease^. The application set is a locally hosted version 
of the photoPrimary view of the DR3 CAS. As with 
specDbj , this is the view which contains the primary 
SDSS observations with no duplicates. The raw specObj 
data consists of 477,081 objects and the photoPrimary 
data 142,705,734^. 

The training, testing and application data consist of 
SDSS photometric and spectroscopic attributes available 
in the photoPrimary and specDbj tables. We perform 
no additional image or spectral reduction. Since we clas- 
sify all sources in the SDSS DR3, this allows a user of 
our classifications to apply their own sample cuts based 
on SDSS flags or attribute errors. The photometric at- 
tributes used for each object are the colors u — g, g — r, 
r — i, a nd i — z, correc ted for galactic reddening via the 
maps of iSchlegel et al.l |1998). The colors in each of the 
four magnitude types PSF, fiber, Petrosian and model 
are used. The use of the four magnitude types allows 
star-galaxy separation to be, to some extent, built-in. 
The light profiles of galaxies and stars are very differ- 
ent, which can be quantified by measuring the object 
flux in different-sized apertures. This fact is often used 
to perform star-galaxy separation, including within the 
SDSS photometric pipeline where the difference between 
psfMag and cModelMag is used (see §|21for more infor- 
mation). The spectroscopic attribute is the target object 
type and is given by the specClass value in specObj as- 
signed by the SDSS spectroscopic pipehne. 

The specClass is discrete and takes the values 
unknown, star, galaxy, qso, qsoJiiz or star_late. We 
construct our three training samples: galaxies, stars, and 
nsng, from these classes according to the mappings shown 
in Table . Resulting numbers of each are 361,502 
galaxies, 62,333 stars and 53,246 nsng, giving a total of 
477,081 objects available for the training and testing pro- 
cess. Given the spectroscopic data available in the SDSS, 
the nsng training sample will predominantly consist of 
quasars. Decision trees, however, effectively utilize all 
of their training data. Thus, at fainter magnitudes, we 
can expect the nsng sample to contain sources that are 
not just quasars, but unknowns in conjunction with the 
unknown category in the specClass. 

On the other hand, decision trees are also robust to 
outlying values in their training data. For the classi- 
fication training we simply apply the very broad cuts 
— 40 < color < 40 to remove clearly unphysical values 
from the data such as -9999. These are applied to each 
of the colors and resulted in the removal of 12 galaxies 
and 1 quasar, leaving 477,068 for training and testing. 

For the decision tree optimization, sets of objects were 
randomly drawn from the remaining specObj set and 

^ |http : //cas ■ sdss . org/astrodr3 /en] 
^ http://www.mso.aiiu.edu.au/2dFGRS] 
® http://www.2dfquasar.org 

^ A previous version of the DR3 CAS had a slightly different 
total number, 142,759,806, traceable to objects in run 2248. 

* There are also the classes sky and gal.em (emission-line galaxy) 
in the DR3 database, but there are no examples of the former in 
specObj and the latter is a placeholder. 



used in training, with the rest for testing. Thus the test- 
ing set, in a strict sense, is not independent of the train- 
ing set. However, the large numbers of objects available 
in the training data means that the random subsamples 
drawn, except perhaps the smallest ones at a level of 
10% or less, are large enough to still be truly represen- 
tative of the training data and should thus be effectively 
independent. 

The performance results quoted in fQl&re, nevertheless, 
from a tree trained and tested on 80% of the training 
data and applied to the other 20%. The 80% is subdi- 
vided as above into non-overlapping training and test- 
ing sets, also in the ratio 80:20. Given the hypothesized 
easily-adequate size of the training set, this pseudo-blind 
test should be very similar to the test results during the 
optimization. 

For the final application to the photoPrimary data, 
the tree was trained on all of the training data to maxi- 
mize the available information. Thus the results quoted 
in 2] may be slightly pessimistic, but again they are 
likely to be very similar to the performance of the tree 
on photoPrimary, within the spectroscopic regime. For 
the vast majority of objects in the photoPrimary table, 
no spectroscopic information is available. In addition, we 
do not restrict the application dataset, as each object is 
classified independently. Therefore, specific cuts, either 
on SDSS flags, attribute errors, or the attributes them- 
selves can be later applied by a user as appropriate to 
create a particular sample. For example, one might only 
want 'best' photometry, as indicated by the object flags, 
for comparing object populations, or one might want to 
apply a color cut to enhance the utility of the provided 
classifications. 

The blind testing sets are drawn from the 2dFGRS 
and the 2QZ. These test the performance of the decision 
tree on non-SDSS data in a parameter space which ex- 
tends beyond the SDSS DR3 spectroscopic regime. The 
2dFGRS was used to test the best tree's galaxy clas- 
sifications and the 2QZ to test its star and nsng clas- 
sifications. The surveys were matched with the SDSS 
using object position with a tolerance of 2 arcseconds 
and their spectroscopic classifications were taken as the 
target types. Of the 2dFGRS galaxies, 50,191 were 
matches. For the 2QZ the results were compared to the 
objects assigned the category 11 (best classification and 
redshift; see Groom ct al. 2004) objects (8,739 quasars, 
5,193 stars) and those assigned 12, 21 or 22 (second-best; 
9,273 quasars). 

3. ALGORITHMS 

Star-galaxy separation in the SDSS is currently 
performed within the SDSS photometric pipeline 
(jLupton et al.ll20d^ . An object is classified as extended, 
and hence a galaxy, if psfMag — cmodelM ag > 0.145. 
Here psfMag is the point-spread-function magnitude and 
cmodelMag is a combination of the best fitting de Vau- 
couleurs and exponential profiles. The separation is done 
for each band individually, and for all five bands com- 
bined. psfMag is describ ed further in iSto^ighton et all 
(j2002() and cmodelMag in 'Abazaiia n et all l)2004|) . Ad- 
ditional star-galaxy classification s ha v e been performed 
on SDSS data, e.g., IScranton et alJ l|2002D perform a 
Bayesian star-galaxy separation via differences between 
the r band psfMag and modelMag. 
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In our approach, we use an axis-parallel decision tree 
to assign a probability that a source belongs to each of 
three classes. These probabilities always sum to one and 
reflect the relative degree of certainty about an object's 
type. We provide cuts in these probabilities that can be 
applied to generate catalogs to a required completeness 
and efhciency of either galaxies, stars, or neither galaxies 
nor stars (i.e., nsng). In general, the cuts can be a func- 
tion of the probabilities, and these functions can have 
both minimum and maximum values. 

We define both completeness and efficiency as func- 
tions of the classification probability. For a given sam- 
ple, therefore, the completeness is the fraction of all cat- 
alog sources of a given type found at a specific probabil- 
ity threshold. Likewise, efficiency is the fraction of all 
catalog sources that are correctly classified to a specific 
probability threshold. In previous astronomical machine 
learning results, the efficiency of a sample has also been 
called the reliability or purity, and 1.0— efficiency has 
been called the contamination. The quantities are also 
known fe.g.. iWit ten fc Frank 20Q^ as the recall and pre- 
cision. 

As well as a maximal combination of completeness and 
efficiency, one may also desire maximal values of these 
measures separately. An example of the former is to in- 
clude interesting or unusual objects of a certain class, for 
example, candidate objects for gravitational lensing. An 
example of the latter is applications such as the 2-point 
correlation function of extragalactic objects, where one 
wants to minimize contamination by no n-cosmological 
objects such as Galactic stars fe.g.. lMvers ct al. 2006). 

Our decision tree methods are applied within 
the framewo rk of the Data-to-Knowledge Toolkit^ 
ijWelge et al.l ^99) , developed and maintained by the 
Automated Learning Group at the National Genter for 
Supercomputing Applications (NGSA). This is a Java- 
based application which allows numerous modules, each 
of which performs a single data processing task, to be 
interconnected in a variety of ways. The framework al- 
lows the easy addition of further data to the training and 
testing process and the application of the trained decision 
trees to the entire photometric catalog. The trees are im- 
plemented on the Xeon Linux Gluster tungsten at NCSA. 
This nationally allocated supercomputing system is com- 
posed of 1280 compute nodes. Each node has available 2 
GB of memory and a peak double-precision performance 
of 6.4 Gflops. 

Our use of the tungsten supercomputing system was 
facilitated by a large, peer-reviewed, national allocation 
submitted by the Laboratory for Cosmological Data Min- 
ing at NGSA, for which RJB is the principal investigator. 
Execution tasks were submitted to tungsten via a shell 
script that created and tested one or more decision trees 
whose testable parameters were specified at run time. 
One or more decision trees would be created and tested 
on a single tungsten node, and when completed the test 
results for each decision tree was returned for subsequent 
analysis. As this process was embarrassingly parallel, we 
submitted multiple execution tasks simultaneously from 
this single shell script, with a maximum of 240 submit- 
ted at any one time. As discussed in the next section, 
we analyzed the results of over seven thousand decision 

® |http : //alg ■ ncsa . uluc ■ edu/do /tools/ d2k| 



trees, which spanned a range of different decision tree 
parameters in selecting our optimal parameter values. 

3.1. Decision Trees 

Decision trees consist first of a root node in which the 
parameters describing the objects in the training set pop- 
ulation are input, along with the classifications. The tree 
is usually considered upside-down, with the root node at 
the top. Here there are three classes, so for each object 
we have 

input = ([/i,/2,/3,...,/l6],[Og,0„,Os]) (1) 

for the 16 features / (four colors in four magnitude types) 
and the three output classes o. The outputs are [1,0,0], 
[0,1,0] and [0,0,1] for galaxy, nsng, and star, respectively. 
The outputs are in the form 

output = [p(og),p(o„),p(o,j)], (2) 

and the predicted probabilities of class membership al- 
ways sum to one. 

A node population is split into population groups that 
are assigned to child nodes using the criterion that min- 
imizes the classification error. This is a measure of how 
good the classification is when the tree that would re- 
sult from the split is run on the test set. The process 
is repeated iteratively, resulting in a number of layers 
of nodes that form a tree. The iteration stops when ei- 
ther all nodes reach the minimum allowed population of 
objects in a node (the minimum decomposition popu- 
lation; MDP), the maximum number of nodes between 
the termination node and the root node (the maximum 
tree depth; MTD), or the population split no longer de- 
creases the classification error by a minimum set amount 
(the minimum error reduction; MER). The nodes from 
which no further nodes branch are the leaf nodes. 

The computational complexity of the algorithm is 
0{nmd), where n is the number of examples, m is the 
number of features, and d is the depth of the tree. Here n 
is 0(10^, whereas m and d are 0(10), giving a product 
of O(IO^). 

The split is tested for each inpu t feature. We use ax is- 
parallel splits, in which, following lOwens et al.l l)1996() . 

QiX > k, (3) 

where is the ith attribute for example X, and fc is a 
value to be tested. An alternative, which we have not 
utilized in this work, is the oblique split 

d 

a,Xi + ad+i > (4) 

2 = 1 

for d attributes. This allows hyperplanes at arbitrary 
angles in the parameter space. Here we allow the split 
point to be either the midpoint, the mean, or the median 
of the values of the parameters for the node population, 
with the best of these chosen for each individual node. 
With the removal of extreme non-physical outliers such 
as -9999 from the training data, selecting the optimal 
statistic from those measured at each node splitting pro- 
vides better results than consistently selecting the same 
statistic. 

The classification error we used is the variance 

j2i^^-P^r, (5) 
1=1 
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where i is the output feature index, A is the actual out- 
put feature value (0 or 1), and P is the probability pre- 
diction made by the decision tree. The data is modeled 
using the mean of all the output vectors, and the result 
is interpreted as a probability over class space. 

We used n-fold bagging (bootstrap-aggregating) in ad- 
dition to the optimized tree and node parameters. In this 
procedure, the testing set is a randomly sampled frac- 
tion (bagging fraction) of the original testing set, and 
the procedure is repeated n times, creating n decision 
trees. The results are then voted on, using simple ma- 
jority voting. Similarly, the testing set itself is randomly 
subsampled from the training set, a procedure known as 
cross-validation. We perform 10-fold bagging, but use 
1-fold cross-validation, as only the former improved the 
results. 

The parameters of the decision tree, and additional 
methods such as bagging, can have a substantial effect 
on the quality of the results produced. The massively 
parallel supercomputing resources we used in this work 
allowed extensive investigation of this parameter space, 
including: 1) the minimum decomposition population, 
2) the maximum tree depth, 3) the minimum error re- 
duction, 4) the method of splitting the population at a 
node, either half-and-half, midpoint, mean or median, 5) 
the ratio of sizes of the training set to the testing set, 
6) the bagging fraction, and 7) the random seed used 
in selecting the subsamples for the testing set and the 
bagging. In addition, the construction of thousands of 
trees and the use of advanced visualization allowed us to 
investigate the interaction of these parameters. 

3.2. Probabilities 

Once wc have constructed an optimal decision tree, 
given the available training data, we can characterize 
the best probability thresholds to construct samples from 
further data to the desired level of completeness or effi- 
ciency. We investigated the basic probabilities P {galaxy), 
P{nsng) and P{star) (hereafter pg, pn, and Ps) and the 
ratios 

f{p) = \og{pg/ps),log{pg/pn), and log(p„/ps) (6) 

for each target type galaxy, star and nsng for binned 
minima and/or maxima, using the 80:20 pseudo-blind 
test described in 321 

Small numbers of objects in the leaf nodes resulted in 
discretization in the output probabilities from the deci- 
sion tree. Therefore, care was taken when comparing the 
resulting mixture of exact and floating point values to 
bin edges to ensure that the counts were correct and not 
offset due to finite floating point precision. The accuracy 
was achieved by using a Decimal datatype, which stores 
numbers as exact values to a specified number of decimal 
places. 

4. RESULTS 

4.1. Decision Tree Parameter Optimization 

As described in 'i'i.W a decision tree has a number of 
parameters that can be adjusted. The product of these 
parameters defines an enormous parameter space of pos- 
sible trees that must be explored to identify what we 
expect to be the optimal decision tree. In addition, this 
parameter space can potentially include many local min- 
ima of the classification error, further complicating this 



process. Given the size of the parameter space and the 
computational time required to construct and test an in- 
dividual decision tree, we utilized the NCSA tungsten 
cluster (described in more detail in [21) to perform an ex- 
tensive exploration of this parameter space. All decision 
trees were built from the same training set with the 16 
input parameters described in [{21 and the full parameter 
space we explored is described in Tabled 

First, we studied the interaction of the minimum node 
decomposition population (MDP), the maximum tree 
depth (MTD) and the minimum error reduction (MER) 
for splitting the population. The values investigated were 
MDP = 2°-, where < a < 15, 1 < MTD < 20, and 
MER = 10^ where -6 < 6 < 6, and MER = 0. Negative 
MERs do not split the population. This gave 4,480 com- 
binations, and for each one a decision tre e was generated. 
Figure n shows a Partiview l)Levvll20'0l|) visualization of 
the results. The best tree had an MDP of 2 and an MER 
of 0. MTD was limited to 16 by the combination of the 
size of the training set and the memory available on tung- 
sten but, as shown in Figure ^ a deeper tree is unlikely 
to give a substantial improvement. 

We next varied the statistic used to split the population 
at the nodes: halving the population, using the midpoint 
of each input, the mean of each input, and the median. 
The 16 combinations give very similar results, with the 
exception of those which allow population halving, which 
are significantly worse. As a result, we allowed the al- 
gorithm to select, when splitting each node, the optimal 
statistic from the midpoint, mean, and median split val- 
ues. 

In all previous measurements, we had set the level of 
bagging used when building the decision tree at 20-fold 
and assumed a ratio of the fraction of the training set 
used in the bagging to the fraction of the sample used in 
training of 80:20 (i.e., 80% for training to 20% for test- 
ing) . We conducted a number of tests to verify these as- 
sumptions. When varying the level of bagging, we found 
that the classification error is substantially reduced when 
the bagging level is increased from 1- to 10-fold, and 
continues toward an asymptotic value slightly below the 
value seen for the 50-fold case. As with our tests in- 
volving the maximum tree depth, the number of models 
was limited by the available memory on each node of 
the tungsten cluster, although given the asymptotic ten- 
dency, we do not expect that the results would change 
substantially for larger levels of bagging. 

We cross-tested the training to testing ratios with the 
n-fold bagging test, finding the best results were obtained 
using a 50-fold model with an 80% training subsample, 
as shown in Figure (21 Given memory limitations, the 
actual trees used in our other parameter tests were al- 
ways limited to a 20-fold bagging. When performing our 
final classifications, however, we used the more accurate 
50-fold bagging. In all subsequent tests we set the train- 
ing to testing fractions to 80:20. We also note that the 
bagging prevents an overfitting of the training set: for 
low values of bagging, such as 1-fold, the classification 
error has a minimum at a particular MTD, and increases 
substantially for deeper trees. 

Finally, the effect of the random seed in choosing a 
same-sized training set was investigated and is shown in 
Figure 13 The la variation in the classification error 
is approximately 0.1%. Given this variation, we quote 
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the best classification error achieved as being (3.07 ± 
0.08)%. This is a robust result as the classification error 
shows a broad, approximately flat, minimum in several 
areas of the parameter space. See, for example, Figure 
121 where the flat area corresponds to approximately 3% 
classification error. 

Although our exploration of the parameter space was 
extensive, the number of possible trees in the space 
quickly becomes very large with the variation of all pa- 
rameters, so it is always possible that a better tree ex- 
ists than the one we used. Although given the widely 
observed 3% minimum it is unlikely to be significantly 
better, especially for the training set we used. 

4.2. Testing Star-Galaxy Separation and Probability 

Cuts 

Once the optimal decision tree parameters were quan- 
tified, we constructed and subsequently tested this deci- 
sion tree using unseen data to give a more realistic idea 
of the quality of our classifications. We achieved this re- 
sult by building the tree with the previously described 
parameters, but training on 64% of the DR3 data and 
testing on 16% (i.e., maintaining the optimal 80:20 ra- 
tio). We tested this decision tree in a pseudo-blind fash- 
ion by using the remaining and unseen 20% of the DR3 
data, which consisted of 95,413 sources. From this clas- 
sification test, we compared the assigned types galaxy, 
star and nsng to the true types, in terms of completeness 
and efficiency as a function of the three probabilities Pg, 
Ps, and pn- 

Overall, this experiment demonstrated that our axis 
parallel decision tree is very successful at classifying ob- 
jects into these three types, and, as a result, at star- 
galaxy separation. The vast majority of objects are clas- 
sified correctly and given high probabilities for the ap- 
propriate type. The confusion matrix is shown in Fig- 
ure 21 and tabulated in Table Figure [SI compares the 
true fraction of correct classifications as a function of 
assigned probability from the tree, demonstrating that 
these are approximately correct. We note that there is 
some discretization in the probabilities due to the po- 
tentially small number of objects in a leaf node. As 
can be seen in Figure ^ however, increasing the min- 
imum decomposition population to the point at which 
the probabilities would not be discretized would worsen 
the decision tree performance. 

In general the function for generating subsamples ac- 
cording to probability is 

f{p) ^ f{Pg,Pn,Ps)- (7) 

In all subsequent tests, we analyze a subsample in terms 
of true positives (TP), false positives (FP), true nega- 
tives (TN) and false negatives (FN) for each of the three 
target types separately. We define the completeness, c, 
as the number of the target type that are included in the 
f{p) cut compared to the total number of objects of that 
target type in the whole sample: 

TP 

c= — , (8) 

TPaH + FNall ^ ' 

where i indicates the objects to the probability threshold 
and all indicates all the objects in the test set. For exam- 
ple, in our pseudo-blind test set this would correspond 



to all 95,413 sources. The efficiency, e, is the fraction as- 
signed a particular type that are genuinely of that type: 

TPi 

' = TP,+Fpr 

The completeness varies between and 1 and in general 
increases at the expense of efficiency. An object can be 
TP or FN depending on the probability threshold. 

We define the best sample for a specific test to be that 
which maximizes the metric: 

best = max[^(c2 -f 6^)1/2] . (10) 

All the ratios, but not the simple probabilities, exclude 
objects for which any of the three probabilities are zero. 
In our pseudo-blind test, 18,989 of the 95,413 sources, or 
roughly 20%, were affected in this manner. We used the 
logarithm of a specific probability ratio, given the large 
dynamic range of the ratios of the the three probabilities. 
In all cases the classification is taken to be the object type 
assigned the highest probability by our decision tree. If 
the highest probability is less than 50% it is possible that 
two types are assigned an equal probability, which could 
introduce a systematic effect in our classification. We do 
not expect this to be problematic in practice, however, 
as this effect is small, only affecting one object in our 
pseudo-blind testing, and excluding these sources is easy 
after classification has been performed. 

To determine the best probability function, i.e., /(p), 
and associated probability values to use when assigning 
classifications, we computed a number of different func- 
tions involving the three probabilities assigned by our 
decision tree. Table ^ provides a compilation of these 
different functions and a characterization of their dis- 
tribution sampled across one hundred bins. From the 
values in Tabled we find that by using the simple min- 
ima in the three class probabilities we obtain the most 
accurate classification samples in our pseudo-blind test- 
ing. These probability cuts are Pg > 0.49, which results 
in 98.9% completeness and 98.3% efficiency for galax- 
ies, Pn > 0.54, which results in 88.5% completeness and 
94.5% efficiency for nsng, and ps > 0.46, which results in 
91.6% completeness and 93.5% efficiency for stars. We 
find that a number of probability cuts near fifty percent 
produce similar classification accuracies. This is not sur- 
prising, as fifty percent is an ideal threshold, as objects 
below this percentage are likely to be of a different class 
since they can have a higher probability of being one of 
the other two types. 

In some cases, ratios of probabilities are able to select 
better samples when a maximum value is also used as 
opposed to only using a minimum value. For example, 
for the probability ratio f{p) = \og{pg/ps), the sample 
is dominated by stars at the low end and galaxies at the 
high end. So the nsng are only selected by requiring a 
minimum and a maximum f{p). In the end, however, 
none of the ratio cuts we have studied provided a better 
combination of completeness and efficiency than the sim- 
ple probability cuts. We interpret this result as further 
evidence that our decision tree is successful, especially 
within the regime of spectroscopically confirmed objects 
used for training (i.e., the interpolation regime). The 
full completeness-efficiency curves, as a function of Pg, 
Pn , and Ps , are shown in Figure 
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4.3. Application to the SDSS Photometric Catalog 

We constructed our final decision tree fi'om the full 
(i.e., 100%), spectroscopic training sample from the 
SDSS DR3. By using the data-streaming modules avail- 
able within the D2K environment, we applied this de- 
cision tree to the full SDSS DR3 photoPrimary cata- 
log of 142,705,734 sources. With this paper, we pub- 
licly release the classification probabilities for every one 
of these sources. Following the classification guidelines 
discussed in the previous section, within the magnitude 
range < r < 40, this sample consists of 38,022,597 
galaxies, 57,369,754 stars and 47,026,955 nsng. The sum 
of these subsample sizes do not match the size of the full 
sample because of (a) the magnitude restriction, which 
removes 589 objects, and (b) source classifications that 
are ambiguous due to equal-highest probabilities for mul- 
tiple object types, which removes 285,839 (0.2%) objects. 

One of the most important issues with any supervised 
classification effort is the application of an algorithm 
trained on one sample of data to a different set of data. 
This concern is generally framed in terms of interpolation 
versus extrapolation, and in our case results from the ap- 
plication of our decision tree that is trained on sources 
from the SDSS spectroscopic sample being applied to the 
SDSS photometric sample, which is, in some cases, con- 
siderably fainter than our training set. Given the strong 
dependence on the number counts of galaxies and stars 
with apparent magnitude, this affects the vast majority 
of the sources we have classified. Without deeper train- 
ing data, however, we feel that the best approach is to 
classify all sources. This approach allows anyone using 
our classification catalog to create a well-defined sample 
for further study or follow-up spectroscopy that does not 
have any selection effects artificially imposed due to our 
classification algorithm. 

While we cannot guarantee the accuracy of our clas- 
sifications across the full SDSS photometric sample, we 
can provide some guidance on the accuracy of specific 
sample restrictions. Historically, one of the most pow- 
erful techniques for characterizing classifications in the 
absence of spectroscopic identifications is by analyzing 
the differential number of sources as a function of appar- 
ent magnitude (i.e., number count plots). This technique 
is simple and straightforward; by comparing the number 
count distribution of training sources with the number 
count distribution of classified sources, we gain a statis- 
tical estimate of the apparent magnitude limit at which 
we can no longer expect our extrapolated classification 
rules to remain robust. 

We present, therefore, in Figure |3 the number 
counts for both the training set and for the classified 
photoPrimary catalog for our three different classifica- 
tions: galaxy, nsng, and star. The SDSS spectroscopy 
is lim ited to r < 17.77 for galaxies (Strauss ct alj 
I2002|) . which can clearly be seen in the galaxy train- 
ing set number counts. Similarly, the nsng training set, 
which is dominated b y quasars, is hmited to i < 19.1 
({Richards et al.l 120021) . which is reflected in the decline 
in the r-band number counts around r ~ 19. The loga- 
rithmic slopes of the galaxies and nsng are approximately 
0.5 and 0.6 respectively for the training set. Stars, on the 
other hand, have a more heterogeneous selection scheme 
in the SDSS survey, with many subpopulations, which is 



reflected in the number count distribution of the stellar 
training set. 

We see clear evidence for reliable star-galaxy separa- 
tion in Figure 13 where the nsng counts remain consider- 
ably lower than the star and galaxy counts until r < 20. 
Given the spectroscopic limits of the SDSS, this figure 
suggests that the photoPrimary classifications are reli- 
able to substantially fainter magnitudes than the spec- 
troscopic sample. The galaxy counts approximately fol- 
low the logarithmic slope for around two magnitudes 
fainter than the training sample, and the nsng counts 
extend even fainter. The inflection points in the counts 
at r ~ 20.5 suggest that our classifications are reliable to 
this magnitude and fainter than this the tree becomes less 
reliable and increasingly assigns objects to nsng, which 
is one of the primary reasons we included this third clas- 
sification in our algorithm. 

The SDSS photometric sample does extend fainter, the 
95% detection repeatability limits for point sources be- 
ing u < 22.0, g < 22.2, r < 22.2, i < 21.3 and z < 20.5, 
which is seen in the number count distributions as the 
abrupt turnover at r > 22. If, as the number counts 
suggest, our classifications are reliable to r < 20, the 
numbers of reliable classifications are 7,978,356, 832,993, 
and 13,945,621, totaling 22,756,970 for the galaxy, nsng, 
and star classes, respectively. We must emphasize, how- 
ever, that the number of sources that will actually be 
scientifically useful will be lower, as these samples have 
not been cleaned. For example, before using these data, 
appropriate cuts should be made on either photometric 
errors, detection flags, or both, as described more fully 
on the SDSS project website. 

Since we used colors, as opposed to magnitudes, to 
classify our sources, we also compared, as shown in Fig- 
ure |S1 the g ~ r distribution of training set sources with 
the photoPrimary classified sources. The extrapolation 
in g — r color in going from the testing set to the classi- 
fied sources is considerably less than the corresponding 
extrapolation in magnitude, although it is still present, 
particularly in the nsng class toward redder colors. As 
expected for the training set, galaxies and stars show 
a bimodal distribution in color and the nsngs are domi- 
nated by blue objects (e.g., UVX-selected quasars). This 
bimodality is washed out for galaxies in our classified 
sample, however, presumably due to the fact that the 
photometric sample has a higher mean redshift than the 
spectroscopic sample. We note that this difference in 
mean redshift will not affect the apparent magnitude 
number count distributions, as the redshift difference is 
too small to allow for significant evolution to occur be- 
tween the two samples. 

In Figure]^ we present a visualization of the three clas- 
sification probabilities for all objects in photoPrimary, 
demonstrating how the three values interact. Since the 
three probabilities must always sum to unity, their distri- 
bution appears as a triangle, with the height correspond- 
ing to the density of objects in that region of probability 
space. One can see that galaxies and stars are assigned 
relatively unambiguously, but that the tree is generally 
less certain when an object is neither a star nor a galaxy, 
as the base of the triangle rises noticeably before reach- 
ing pn = 1. Discretization is also seen in this figure, but 
is insignificant compared to the numbers of each source 
type with high probability. The data used to produce the 
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visualization are available to download with the catalog. 
4.4. Blind Testing 

In the pseudo-blind test results described earlier 
fS 14.2(1 . our decision tree provided robust classifications 
for data that was not used to train the decision tree, nor 
to characterize its performance. While this pseudo-blind 
test did provide a good approximation to a blind test, a 
true blind test is to compare the decision tree classifica- 
tions to sources that have been assigned a spectroscopic 
identification from another survey. We have, therefore, 
matched the SDSS DR3 photometric data to the 2dF- 
GRS and 2QZ surveys, by using coordinate position with 
a tolerance of 2 arcseconds. The 2dFGRS resulted in 
50,191 matches and the 2QZ 10,259 matches. The 2dF- 
GRS contains mainly galaxies and the 2QZ stars and 
quasars with a small number of mainly narrow emission- 
line galaxies. 

By applying the optimal cut s of Pg > 0.49, p„ > 0.54 
and ps > 0.46, as detailed in M.2\ we can characterize 
our classification accuracy in this blind test as a func- 
tion of apparent magnitude. We show these results for 
the 2dFGRS galaxies in Figure [TOI for the 2QZ stars in 
Figure ^2 For galaxies, the overall completeness of the 
2dFGRS galaxies is 93.8% (47,055 of 50,191), with the ef- 
ficiency undefined as this is the only type of object. The 
completeness of the stars in the 2QZ match is rather low, 
at 79.7% (4141 of 5193), but the efficiency is 95.4% (4141 
of 4359). Figure ITTI shows that the low completeness is 
due to stars at r > 19. The low value may be due to the 
fact that the 2QZ is UVX-selected, which causes unusual 
stars that lie off the main-sequence, such as sub-dwarfs, 
halo stars, and hot young stars, to be preferentially tar- 
geted. 

We also blind-tested the nsng classifications in a simi- 
lar manner to the galaxy and star classes, but in this case 
by using quasars from the 2QZ with the same matching 
criteria as the one we used for stars from the 2QZ. As 
shown in Figure 1121 our decision tree does an excellent 
job of assigning quasars to the nsng class, with a com- 
pleteness of 94.7% (8278 of 8739) and an efhciency of 
87.4% (8278 of 9471) for the type 11 best IDs and with 
a completeness of 94.4% and an efficiency of 85% for the 
12, 21 and 22 next-best IDs. The efficiency drops off 
from around 98% fainter than the SDSS spectroscopic 
limit ofi < 19.1, and is about 87% at r 20.5 when the 
counts reach their limit in the 2QZ. This result is further 
vindication of adding the third class to our algorithm, 
as these sources are clearly neither stars nor galaxies, 
and our algorithm identifies them appropriately. Over- 
all, these blind tests support our assertion, based on the 
number count distributions, that our classifications re- 
main reliable to r ~ 20. 

5. DISCUSSION 

5.1. Star-Galaxy Separation and Extrapolation 

The axis-parallel decision tree algorithm we have 
adopted in this work is an example of supervised ma- 
chine learning. One of the primary criticisms of super- 
vised techniques is their difficulty in extrapolating past 
the limits of the actual data used in their construction. 
Supervised learning algorithms can be viewed as a map- 
ping between the training parameter space and a clas- 
sification, and are not necessarily a representation of 



anything physical (e.g., a decision tree might correctly 
classify stars, but it would not necessarily reproduce the 
Hertzsprung-Russell diagram). Ideally, the training data 
should sample, perhaps sparsely, the entire parameter 
space occupied by the data to be classified. In the case 
of astronomical datasets, such as the star-galaxy sepa- 
ration problem being studied in this paper, this is of- 
ten impossible. We do not have a sufficiently large and 
faint spectroscopic survey to adequately sample either 
the magnitude or color spaces spanned by the photomet- 
ric data from the SDSS survey. 

Yet we see that our classifications do reliably extrapo- 
late past the training data. What is most likely driving 
this successful extrapolation is that the source colors do 
not change dramatically prior to r ~ 20. As a result, 
the existing training data is sufficient to classify sources 
in this area of parameter space. At these fainter magni- 
tudes, the algorithm increasingly assigns sources to the 
nsng class, as the distinction between stars and galaxies 
is breaking down with decreasing signal-to-noise. We can 
characterize this effect from the classifications in the fi- 
nal catalog, where the fraction of nsng sources in the full 
catalog rises from approximately 1.3% for r < 18 to ap- 
proximately 3.7% for r < 20. This observation, together 
with the obvious examples of sources that are neither 
stars nor galaxies, strengthens the case for introducing 
a third class into the historical star-galaxy separation 
problem. 

To aid in the utilization of our classification catalog, 
we have made available classifications for all sources in 
the SDSS DR3, without regard for photometric errors, 
or object detection flags. As a result, anyone wanting to 
use our catalog must subsequently apply any appropriate 
flag or photometric error restrictions as appropriate for 
their science — ^just as they would when using the SDSS 
databases directly. We feel that this is an important 
step forward in the realm of astronomical source classi- 
flcations. While we are certainly cognizant that a large 
fraction of our source classifications are not highly reli- 
able, we feel it is important in the new era of large surveys 
to provide an astronomer with the greatest possible free- 
dom in selecting their own samples. If we released only 
those sources that pass a potentially opaque classifica- 
tion cut, we force all users of our catalog to deal with an 
additional, implicit selection effect when creating their 
own subsamples from our classification catalog. 

5.2. Future Work 

In this paper, our supervised learning algorithm uti- 
lized photometry and spectra from the Sloan Digital Sky 
Survey. While the SDSS is an incredible resource for 
this sort of investigation, it is limited to the optical and 
near-infrared wavelengths, and, consequently, physical 
processes outside these wavelengths that might distin- 
guish objects otherwise indistinguishable in the optical 
will not be seen. Given the growth in survey astron- 
omy, a large number of SDSS objects have been observed 
in surveys at other wavelengths, and thus an important 
next-step is to include these in the training set. This 
extension has two stages: 1) match the SDSS photome- 
try to spectroscopic targets from other surveys and train 
on the SDSS photometry with these augmented train- 
ing data, and 2) add the photometry from other surveys 
as additional training parameters. The latter is a sub- 
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stantial task as the cumulative nu mber of photometri c 
objects i n surveys such a s GALEX llMartin et alJl2005D. 
2MASS f Skrutskie et alJl200(iD SWI RE iLonsdale et all 
[2003) and ROSAT ijVoges et al.ll99l number in the mil- 
lions. We are in the process of matching the SDSS to 
these surveys, and others that may become available such 
as UKIDSS^", to improve the classification performance 
achieved here. Of course this technique also extends to 
subsequent data releases from the SDSS itself. 

We also would like to investigate alternative mech- 
anisms for selecting SDSS photometric parameters for 
training, including the use of techniques such as forward- 
selection, backward-elimination, or a hybrid of the two, 
in which training parameters are respectively added or 
removed according to their effect on the quality of the 
classification. This approach is adopted by Bazell ( 2000), 
who uses this technique for morphological galaxy classi- 
fication. The training parameters could utilize magni- 
tudes rather than colors, and also morphological param- 
eters such as the probPSF measure. In this paper, we 
have only used the colors of objects through four dif- 
ferent matched apertures, as this work is our first major 
effort in large-scale machine classification of astronomical 
datasets. In the future, we anticipate performing a more 
detailed exploration of different classification parame- 
ters, especially as we incorporate additional datasets at 
other wavelengths into our algorithms. 

One training parameter that was found to not be use- 
ful in the current framework is the error on an object 
attribute, in agr eement with previous analyses (see, e.g., 
iBall et aLll2004f) . Fundamentally, this is because, with- 
out additional information, the decision tree cannot dif- 
ferentiate between an error and another parameter. Cou- 
pled with the different dynamic ranges of the two types 
of attributes, this effect essentially trivializes the error 
attributes. Since irrelevant attributes dilute the real in- 
formation content in a training dataset, they reduce the 
performance of a decision tree. As a result, we did not 
include magnitude errors in our analysis. Intellectually, a 
better technique might be to include the error as part of 
the appropriate attribute when constructing a machine 
learning algorithm. Our decision tree implementation, 
however, does not provide this capability, and we there- 
fore defer this concept to a future paper. 

A further decision tree algorithm, supported in the 
Data-to-Knowledge framework but not yet extensively 
investigated, is boosting, in which the differences between 
classes are emphasized. This method has been shown to 
improve results by many w orkers in the data m ining field 
and is described further in lHastie et al.l lj200lD . Besides 
decision trees, D2K supports numerous other algorithms, 
including naive Bayes, artificial neural networks, support 
vector machines, k-means clustering, and instance-based 
learning. The latter is of particular interest because, al- 
though it has shown promising results, its full potential 
has not been realized, due to the computational require- 
ments of calculating the classifications for the full appli- 
cation set. 

In this method, the distance between each test or ap- 
plication set object (each instance) and the nearest n 
neighbors in the training set is measured and a weighted 
mean is taken to give the new object its assigned type. 

|http : //www ■ ukidss . org] 



Thus the training consists of specifying n, the weighting, 
and the distance measure, typically of the form in 
each dimension where d is the distance and x — 2 would 
be the Euclidean distance. Hence there are a large num- 
ber of distance calculations involved. Typically the clas- 
sification of an object takes of the order of a second on 
a desktop machine, rendering the process computation- 
ally intractable for the 143 million objects in the SDSS 
DR3. The process is, however, embarrassingly parallel, 
so is amenable to a significant speedup by using super- 
computing resources. In which case, this technique could 
potentially be used to not only classify sources, but to 
characterize the energy generation mechanisms responsi- 
ble for a source's luminosity. We anticipate applying this 
technique, for example, to characterize the fraction of ev- 
ery source's luminosity in terms of the ratio between the 
thermonuclear fusion powering stars and the accretion 
powering quasars. 

In many applications it has been found that a mix- 
ture of experts, that is, the combination of results from 
more than one learning method, is best, in an analo- 
gous way to the improvement seen here through the use 
of bagging. A simple way to combine results is to take 
a weighted sum of outputs from the different methods. 
Preliminary tests have shown that instance-based with 
n — 20 and x = 6 in a 0.7:0.3 combination with a deci- 
sion tree gives improved results. Of course, one must be 
able to run a full instance-based classification to explore 
the potential of combining it with the other methods, 
which heretofore has been impossible due to the nec- 
essary computing resources. Another p romisin g future 
method is semi-supervised classification l|Bazell fc MilleJ 
l2005j) . in which the algorithm is given some predefined 
classes but is able to discover further classes in the data 
for itself. This combines prior knowledge from spectra 
with the large number of objects for which there is avail- 
able photometry and may perform better where very few 
spectra are available. 

Finally, in this work we limited our analysis to three 
classes as our main interest was robust star-galaxy sep- 
aration. Given the diversity of both stars and galax- 
ies, and the ignorance of our approach to non-stellar 
luminosity, we could clearly benefit from adding addi- 
tion al classes to our ex isting decision tree work (see, 
e.g..' Suchkov et ani2005j) . As it stands, the nsng class is 
dominated by quasars at bright magnitudes, and could, 
therefore, serve as the basis for a quasar class that we 
expect would provide reasonable quasar classifications 
for relatively bright sources with clean photometry. Ide- 
ally, this approach could be extended to include other 
classes for different types of stars and galaxies, for exam- 
ple, white dwarfs, hot stars, cool dwarfs, starbursts, and 
active galaxies. 

6. CONCLUSIONS 

In this paper we have classified all of the 142,705,734 
non-repeat (primary) objects in the Third Data Release 
of the Sloan Digital Sky Survey (SDSS DR3). The clas- 
sifications were determined by using machine learning, 
the algorithm of choice being the axis-parallel decision 
tree. The algorithm was trained on 477,068 objects for 
which spectroscopic classifications were available within 
DR3. This training data consists of 361,490 galaxies, 
62,333 stars, 49,545 quasars, and 3,700 unknown objects. 
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Collaboration with domain experts at the National Cen- 
ter for Supercomputing Applications (NCSA) and the 
use of the general machine learning environment Data- 
to-Knowledge combined with supercomputing resources 
enabled extensive investigation of the decision tree pa- 
rameter space and the associated datasets. To our knowl- 
edge, this level of investigation has not previously been 
published in the astronomy literature. 

This is the first public release of objects classified in 
this way for the whole survey. The objects are clas- 
sified as either galaxy, star or nsng, and we provide 
an associated probability for each class. These three 
probabilities always sum to unity for every object clas- 
sified. By merely assigning the classification with the 
highest probability, we find that our full classification 
sample contains 38,022,597 galaxies, 57,369,754 stars 
and 47,026,955 nsng in the magnitude range < r < 
40, with 589 outside this range and 285,839 (0.2%) 
ambiguous due to equal-highest probabilities for ob- 
ject type. The catalog is available for download at 
http : //quasar . astro .uiuc . edu/rml 

A major issue with this method of classification is the 
inevitable extrapolation from the spectroscopic regime. 
We investigate this by examining the apparent magni- 
tude number counts for all sources and by performing 
several pseudo-blind or fully blind tests by using sources 
spectroscopically identified in the SDSS, 2dF Galaxy 
Redshift, and 2dF QSO Redshift (2QZ) surveys. We find 
that for a sample giving the optimal completeness and ef- 
ficiency, the completeness values are 98.9%, 91.6%, and 
88.5% for galaxies, stars, and nsngs in the SDSS, and 
94.7% for quasars, which are classified as nsng, in the 
2QZ. The corresponding efficiencies are 98.3%, 93.5%, 
94.5%, and 87.4%. The number count distributions sug- 
gest that the classifications are reliable for r < 20, giving 
7,978,356 galaxies, 832,993 nsng, and 13,945,621 stars. 
As we have not applied any restrictions to the classifi- 
cation catalog, such as limiting by photometric error or 
detection flags, the number of reliable sources will be 
lower; however, we provide these classifications to elimi- 
nate any opaque selection effects from our classification 
catalog, which simplifies the task of using these classi- 
fications to supplement additional analyses, such as as 
defining a target sample for follow-up spectroscopy. 

The assignment of probabilities to each object allows 
one to investigate the completeness and efficiency of the 
classifications as a function of these probabilities. While 
ratios of the probabilities were investigated, we find that 
the samples with an optimal completeness and efficiency 
are yielded by the simple cuts Pg > 0.49, Pn > 0.54 
and Ps > 0.46. Other values close to 0.5 also give very 
similar results. The full data describing the complete- 
ness and efficiency as a function of these three threshold 
probabilities is made available with the catalog. 

We feel that the application of machine learning al- 
gorithms to large, astronomical surveys is a rich area 
of research. We are currently augmenting our training 
data with both additional wavelengths and fainter spec- 
troscopic identifications. To improve our classification 



results, we are also performing additional tests to de- 
termine the optimal parameters for decision trees built 
from these data. Given the efficacy of our approach, 
we plan to increase the number of training classes used 
in our analysis, first by using just the SDSS, and later 
by using additional photometric and spectroscopic data. 
Finally, we are also exploring the efficacy of more pow- 
erful algorithms, such as instance-based classification, to 
tackle the fundamental goal of characterizing sources by 
the fraction of their energy that is derived from fusion 
and accretion. These results will presented in subsequent 
papers in this series. 
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TABLE 1 

Mappings from the SDSS DR3 object spectroscopic 

T^'PE TO nKTISIOX TREE TARCET T^'PE. 



DR3 specClass 


No. Objects 


Target Type 


No. Objects 


galaxy 


361,490 


galaxy 


361,490 


star 


42,101 


star 


62,333 


starJate 


20,232 






unknown 


3,700 


nsng 


53,245 


qso 


46,275 






hiz.qso 


3,270 







Note. — For the training, 13 objects (12 galaxies and 1 
quasar) with clearly unphysical outlying values were removed. 
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TABLE 2 

Values of decision tree parameters tested during optimization. 



Decision Tree Parameter 



Values 



Minimum decomposition population 

Maximum tree depth 

Minimum error reduction 

One half split 

Midpoint split 

Mean split 

Median split 

Number of repetitions 

Fraction train examples 

Number of bagging models 

Bagging fraction 

Random seed subsamples 

Random seed bagging 



32768 16384 8192 4096 2048 1024 512 256 128 64 32 16 8 4 2 1 
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 

999999.0 100000.0 10000.0 1000.0 100.0 10.0 1.0 0.1 0.01 0.001 0.0001 0.00001 0.000001 0.0 

false true 

false true 

false true 

false true 

1 2 5 10 

0.01 0.1 0.3 0.5 0.7 0.8 0.9 0.99 
1 10 20 50 

0.001 0.01 0.1 0.3 0.5 0.7 0.8 0.9 0.99 0.999 
1 ... 32 
1 ... 32 



Note. — Not all combinations of these parameters were tested. See JO] 
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TABLE 3 

Confusion matrix for the SDSS DR3 blind 

TE^TIXC SET. 



Assigned 


galaxy 


Target 
nsng 


star 




galaxy 


71,748 


608 


735 


Number 


nsng 


397 


9,514 


309 




star 


345 


451 


11,305 


Percentage 


galaxy 
nsng 
star 


75.2 
0.416 
0.362 


0.637 
9.97 
0.473 


0.770 
0.324 
11.8 



Note. — From a decision tree trained and 
tested on non-overlapping samples from an 80% 
random subsample of the training data and sub- 
sequently applied to the other 20%. There are 
95,412 objects. One object was given equal 
P{galaxy) and P{nsng) and is thus excluded. 
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TABLE 4 

Probability cuts for the samples which maximize completeness and efficiency. 



f(p) 


Target 


Best min f(p) 


Best max f(p) 


Completeness (%) 


Efficiency (%) 


Average (' 


F (galaxy) 


galaxy 


0.49 


1.00 


98.9 


98.3 


98.6 


P(nsng) 


nsng 


0.54 


1.00 


88.5 


94.5 


91.5 


P{star) 


star 


0.46 


1.00 


91.6 


93.5 


92.6 




galaxy 


0.821 


4.82 


98.2 


97.4 


97.8 




nsng 


-4.70 


-0.989 


76.4 


86.2 


81.3 




star 


-2.32 


2.06 


94.2 


35.0 


64.6 


log(P9/Ps) 


galaxy 


0.770 


4.93 


97.6 


97.5 


97.6 


log (Pg/Ps) 


nsng 


-2.38 


1.97 


95.3 


32.3 


63.8 


log (Pg/Ps) 


star 


-4.32 


-0.711 


74.3 


83.4 


78.9 


\og(pn/ps) 


galaxy 


-1.62 


1.85 


96.7 


92.6 


94.7 


log(Pn/Ps) 


nsng 


2.19 


4.02 


56.5 


90.0 


73.3 


\og(pn/Ps) 


star 


-4.23 


3.15 


100.0 


7.74 


53.9 



Note. — The targets were split into galaxy, nsng, star. The ideal classifier would give a value of 100% in 
the right-hand column. The columns not shown, e.g., Y'^galaxy) with target nsng, give worse results. The 
inverse ratios give the same samples as those shown. 
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ftPD - 16 



Fig. 1. — Partiview visualization of the decision tree classification error from training as a function of the minimum decomposition 
population (MDP) and minimum error reduction (MER) for the maximum tree depths (MTDs) shown (including 1, the front extension 
of the plane at the highest classification error). Each vertex of the mesh represents the result from a decision tree. The tree depth 
was memory-limited to 16 but as the depths not shown follow the pattern of those shown it is clear that there will not be significant 
improvement. The error ranges from 24% for a depth of 1 to 3% for a depth of 16. The best MDP is 2 and the best MER is 0. Their 
values are, respectively, 2" and lO'', where < a < 15, —6 < b < 6, and MER = 0. An MER below zero gives the same result as the single 
layer tree. These and the other decision tree values used are given in Table [Tl 
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50 -fold bagging 



Fig. 2. — As Figure m but showing the bagging fraction of subsample examples (FSE) and the training set fraction (FTE) for 50-fold 
bagging. Here one can see the broad flat base corresponding to the many trees which approximately achieve the minimum classification 
error of 3%. Numerous other plots show this broad base, providing evidence that further improvement in performance can only come from 
improved training data. The best FSE and FTE values are both 0.8. The 0.99 FTE is lower due to a very small test set and is not used. 
The high FSEs have worse errors. Results for 10- and 20-fold bagging are similar but with slightly higher classification error and more 
pronounced change at high FSE and FTE. 
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Fig. 3. — Normalized histogram of the effect of random seed on the decision tree performance, showing the distribution of the percentage 
classification error for instances of different random seeds in the selection of subsamples for training and for bagging. 32 seeds were chosen 
for each, and the tree for each combination computed. The mean is 3.07%, the standard deviation is 0.08, and a Gaussian with these values 
is overplotted, normalized to the same amplitude as the histogram. 
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Fig. 4. — Confusion matrix for the overall assignment of the types galaxy, nsng and star for the SDSS DR3 blind testing set in the 
spectroscopic regime. The three columns in each panel show the numbers of galaxies, nsngs and stars assigned and hence that the decision 
trees successfully assign the types to the objects. The values are tabulated in Table [Tl The vertical axes are logarithmic to clarify the 
significant differences between the number of successfully and unsuccessfully classified objects. The percentages in each panel add to 100%. 
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Fig. 5. — True fraction of galaxies, nsngs and stars compared to the probability assigned by the decision tree that the object is of that 
type. The solid line shows the ideal result of equal values for each object. 
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Fig. 6. — Completeness a nd e fBciency for galaxies as a function of P{galaxy), P{nsng) and P{star) for the 80:20 blind testing set. The 
optimal samples quoted in i|4.2l correspond to the points closest to the upper right of the plot for each object type. The curves extend 
outside the axes shown. 
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Fig. 7. — Number counts of the objects as a function of extinction-corrected r band model magnitudes, subdivided by object type galaxy, 
nsng, or star. The lower set of four curves (closed symbols) shows the target types of objects in the training set. The upper set (open 
symbols) shows the full counts of the types assigned by the decision tree to the SDSS DR3. A small number of objects lie outside the axes 
shown. The highest three peaks in the galaxy and nsng training set curves correspond to the cuts in the spectroscopy for galaxies, quasars 
and high-redshift quasars, at magnitudes r = 17.77, r ~ 19, and r ~ 20.5 respectively. The extrapolation in magnitude to the full DR3 is 
clear. 
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0.001 



Fig. 8. — As Figure 13 but showing the objects in extinction-corrected model g — r color. Bimodality in the galaxy and star color 
distributions is clearly seen in the training sets, but is washed out for galaxies in the full DR3 due to the higher mean redshift for the 
photometric sample. The nsngs are dominated by blue objects, as expected. The extrapolation in color space is considerably less than in 
magnitude space but is still present, e.g., the nsngs extend to much redder colors. The colors were used as the training set parameters. 
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Fig. 9. — Partiview visualization of the galaxy-nsng-star classification probabilities for the full photoPrlmary catalog. The probabilities 
sum to 1 so P{nsng) is equal to 1 at the origin, with P{galaxy) and P(siar) ranging from to 1 along their respective axes. The vertical 
axis shows the object counts — note the logarithmic scale. The upper horizontal axes are placed at a count of 10^ objects per bin. The 
clear separation of stars and galaxies is visible, as is the slightly less certain assignation of nsngs. Discretization in the probabilities can 
also be seen but is insignificant compared to the number of objects at high probability for each type. 
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Fig. 10. — Completeness as a function of magnitude for 2dFGRS galaxies. The upper panel shows the differential counts for matches 
with the SDSS DR3. The lower panel shows the completeness for the integrated counts. The error bars are Poisson. 
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Fig. 11. — As Figure nm but showing completeness and efficiency for 2QZ stars. 
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Fig. 12. — As Figure im but for 2QZ quasars. 



